About the project

Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.

This course gives an overview to understanding the possibilites of open science, open data science and open research tools. During the course, students will analyze and visualize data with statistical methods on open software tools.

Click here for my Github repository


Linear regression (IODS, week 2)

  1. Open the data:
lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)
dim(lrn14)
## [1] 166   7

The data is collected in Introduction to Social Statistics course, in fall 2014. The sample size is N = 166. The survey was originally conducted in Finnish. The data contains backgroud charasteristics age, gender and exam points on the course, about the students who answered the questionare. Students answer questions related on their learning approaches and their attitude towards statistics on likert scale (1 to 5). Observations where the exam points variable is zero are excluded.

Variable attitude describes the mean of questions related to students attitude towards statistics. Variables stra, deep and surf are means of set of questions related to learning strategies. Deep describes questions how deeply student is trying to understand the statistics. Stra describes how student is organizing studying and managing time. Surf is about surface learning i.e. if student is just memorizing things without a context, or if a she doesn’t find purpose for studying the topic.

Some background charasteristics are omitted from the data. The data does not contain many missing values. The meta text can be found here.

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14, mapping = aes(col=gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

The data is collected to see how different variables affect exam points (proxy for learning). Points and attitude have the highest correlation on the matrix above. Points is in absolute value clearly correlated with surf and stra, but not much correlated with age or deep learning. Attitude is very highly correlated with the points, indicating that it may be very important factor for exam points. However, high correlation does not imply an causal relationship between the variables.

Age is not highly correlated with any of the variables in the sample. Also attitude has clearly lower correlations to other variables than points (0.437 for points, less than 0.2 in absolute values for other variables).

The data contains a larger portion of females than males. On average, males have a higher value for attitude than females. Males also have on average lower values for surf (surface learning). For other variables, the distribution is quite equal for men and women. Males who partipated to the course may be more motivated to learn statistics than females.

# create an plot matrix with ggpairs()
ggpairs(lrn14, lower = list(combo = wrap("facethist", bins = 20)))

If the exam points is the target (dependent) variable, we should choose explanatory variables with highest correlation to exam points in absolute value. Points has highest correlation in absolute value with variables attitude (0.437), stra (0.146) and surf (0.144 in absolute value).

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra + surf, data = lrn14)

# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The coefficient of stra is 0.8531, indicating a positive relationship between good learning strategies and exam points. Surf has a coefficient -0.5861, indicating that surface learning could have a negative impact to exam scores. However, explanatory variables stra and surf are not statistically significant on 10 % significance level with p-values 0.11716b and 0.46563. Thus, I remove these variables from the model. I also plot the simple model after removing these variables.

# a scatter plot of points versus attitude
qplot(attitude, points, mapping = aes(col=gender), data = lrn14)  + geom_smooth(method = "lm")

From the plot we can clearly see, that attitude towards statistics is positively correlated with exam points. There may be few outliers on the bottom-right corner of the picture (a few people with weak points from exam but have a positive attitude towards learning statistics).

# fit a linear model
my_model <- lm(points ~ attitude, data = lrn14)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Explanatory variable attitude is statistically significant on 1 % significance level (p-value 1.95e-09).

The coefficient of attitude is 3.5255 (values from summary). The coefficient describes that if the value of attitude increases by 1 (a student has more positive attitude towards studying statistics), her exam score is expected to be 3.5255 points higher. Attitude has a positive statistical relationship to the dependent variable (exam score).

Multiple R-squared has a value of 0.1906 and Adjusted R-squared has a value of 0.1856. Both of these values are low. Thus, model does not account much of the variance in the data. However, low R-squared values do not indicate that the model is not adequate and it is common to have such low values in models, that attempt to predict human behavior. The statistically significant positive relationship between explanotary variable (attitude) and dependent variable (exam points) is the main finding here. The adjusted R-squared is almost the same as on the previous model with three variables, indicating that the fit of the model remains the same.

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))

For using a linear regression model, one should check that error term is be normally distributed with a constant variance and expected value of zero.The error terms should not be correlated and the error term does not depend on explanatory variables.

On the first plot (residuals versus fitted), there seems to be a reasonably random spread of the points. This supports the assumption that size of the errors does not depend on explanatory variables (constant variance assumption). The second graph (Normal Q-Q-plot) shows are reasonable fit, since the points fall close within the line. Hence, it is safe to assume that the errors of the model are normally distributed. From the third plot (residuals vs Leverage) no single observation stands out. Therefore, there are no outliers that should be removed. In conclusion, assumptions for linear regression model hold, and the model seems adequate.


Logistic regression:

library(tidyr); library(dplyr); library(ggplot2); library(GGally)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
  1. Open data:
alc <-  read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep= ",", header= TRUE)

Take a look of the data:

names(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc)
## [1] 382  35

The data contains 35 variables and 382 observations. The data measures secondary education achievements and alcohol consumption in two Portugese schools. The data is combined from two questionares: one about education achievements, and another about education achievements. The observations are combined with factors that are common in both data sets, for identifying the relationship between education achievements and alcohol consumption.

The meta text can be found here.

  1. Let’s look at the data for choosing four variables to study the relationships between high/low alcohol consumption.
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

I chose a list of variables, that I want to consider

cor(alc$high_use, as.numeric(alc$sex))
## [1] 0.1962526

Sex is probably correlated with alchol consumption. I would assume that males drink more than females. Correlation is 0.2 between high use and sex.

cor(alc$high_use, alc$goout)
## [1] 0.352999

Going out may also increase the probability of drinking, and it’s highly correlated with high use (0.35). I would guess, that there is a causal relationship.

However, I should not use failures with absences, since these two are not independent (absences may cause failures). I will also remove health, since bad health may be caused by drinking.

cor(alc$high_use, alc$G3)
## [1] -0.03487925
cor(alc$high_use, alc$absences)
## [1] 0.2229386

So I end up with a list: - absences (positive correlation with alcohol consumption) - sex (positive correlation for males) - going out ( positive correlation 0.22 for high use ) - G3 (negative correlation -0.03 to high use)

g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))
g1 + geom_bar() + ggtitle("Alcohol use by sex")

Higher alcohol use is clearly more common for males.

g2 <- ggplot(alc, aes(goout, fill = sex))

# draw a bar plot of high_use by sex
g2 + facet_wrap("sex") + geom_bar()

There’s no big difference of going out between sexes.It seems that there’s more males who go out often than females. On average, there doesn’t seem to be a big difference.

# initialize a plot of high_use and G3
g3 <- ggplot(alc, aes(x = high_use, y = goout, col = sex))

# define the plot as a boxplot and draw it
g3 + geom_boxplot() + ylab("going out") + ggtitle("Going out by alcohol consumption and sex") 

Going out clearly correlates positively with high use.

# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade") + ggtitle("Grades by alcohol consumption and sex")

# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))

For males, high use of alcohol seems to have a negative effect on grade, but for females the effect is very small.

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")

High use of alcohol is clearly correlated with absences.

The boxplots support all the hypothesis I made previously about the relationships (correlations). However, it’s not possible to draw conclusions about causality from these descriptive plots.

Logistic regression (high / low alcohol consumption)

m <- glm(high_use ~ famrel + goout + absences + sex, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8278  -0.7554  -0.4827   0.7282   2.5976  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.69810    0.65848  -4.097 4.18e-05 ***
## famrel      -0.44809    0.14052  -3.189  0.00143 ** 
## goout        0.79976    0.12510   6.393 1.63e-10 ***
## absences     0.06331    0.01627   3.891 9.96e-05 ***
## sexM         1.06742    0.26378   4.047 5.20e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 371.08  on 377  degrees of freedom
## AIC: 381.08
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)      famrel       goout    absences        sexM 
## -2.69810044 -0.44808654  0.79975625  0.06330634  1.06742473

To find the most parsimonious model, I use the same approach as Vehkalahti, Kimmo & Everitt, Brian S. and therefore I remove one variable at time from the model, and select the model with lowest AIC. ((2019, not yet published, see course materials) Multivariate Analysis for the Behavioral Sciences, 2nd Edition. Chapman and Hall/CRC, Boca Raton, Florida.).

First, I remove sex:

m <- glm(high_use ~ famrel + goout + absences, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8808  -0.7599  -0.5297   0.8415   2.5030  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.28530    0.62653  -3.648 0.000265 ***
## famrel      -0.40001    0.13505  -2.962 0.003057 ** 
## goout        0.79744    0.12153   6.562 5.33e-11 ***
## absences     0.05691    0.01614   3.526 0.000422 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 388.31  on 378  degrees of freedom
## AIC: 396.31
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)      famrel       goout    absences 
## -2.28529556 -0.40001032  0.79744487  0.05691436

Second, I remove absences:

m <- glm(high_use ~ famrel + goout + sex, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5782  -0.7773  -0.5086   0.8240   2.5739  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.2055     0.6284  -3.510 0.000448 ***
## famrel       -0.4689     0.1381  -3.397 0.000682 ***
## goout         0.8057     0.1217   6.620  3.6e-11 ***
## sexM          0.9588     0.2541   3.773 0.000162 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 387.54  on 378  degrees of freedom
## AIC: 395.54
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)      famrel       goout        sexM 
##  -2.2054721  -0.4689221   0.8056623   0.9587523

Third, I remove goout:

m <- glm(high_use ~ famrel + absences + sex, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6402  -0.8524  -0.6013   1.0729   2.1499  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.59731    0.51994  -1.149   0.2506    
## famrel      -0.32188    0.12600  -2.555   0.0106 *  
## absences     0.07119    0.01773   4.014 5.96e-05 ***
## sexM         1.05932    0.24610   4.304 1.67e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 419.25  on 378  degrees of freedom
## AIC: 427.25
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)      famrel    absences        sexM 
## -0.59731225 -0.32187856  0.07118883  1.05931782

Fourth, I remove family relationships:

m <- glm(high_use ~ goout + absences + sex, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ goout + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8180  -0.7872  -0.5325   0.7930   2.4627  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.26004    0.48359  -8.809  < 2e-16 ***
## goout        0.74880    0.12058   6.210 5.30e-10 ***
## absences     0.06601    0.01694   3.897 9.73e-05 ***
## sexM         1.00074    0.25778   3.882 0.000104 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 381.44  on 378  degrees of freedom
## AIC: 389.44
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)       goout    absences        sexM 
## -4.26004047  0.74880103  0.06601437  1.00073889

The AICs are - 381.08 - 396.31 - 395.54 - 427.25 - 389.44

The original logistic regression model should be selected since it has the lowest value for AIC. Thus, there is no need to remove variables from the logistic regression model. This model should be the most parsimonious model that describes the data in adequate way.

# find the model with glm()
m <- glm(high_use ~ famrel + goout + absences + sex, data = alc, family = "binomial")

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)

CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.0673333 0.0177991 0.2371555
## famrel      0.6388494 0.4830454 0.8395914
## goout       2.2249985 1.7533774 2.8667377
## absences    1.0653532 1.0335056 1.1030990
## sexM        2.9078813 1.7469330 4.9254515

Odds ratios of going out, absences and sex male are above one. This implies that these variables are positively assosiated with high use of alcohol. On the other hand, good family relationships is negatively assosiated with high use of alcohol.

# fit the model
m <- glm(high_use ~ famrel + goout + absences + sex, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   254   16
##    TRUE     59   53
summary(alc$high_use)
##    Mode   FALSE    TRUE 
## logical     270     112

According to the table, the model predicts alcohol consumption that is not high right for 254 individuals, when the true value is 270.The model clearly has a weakness with predicting high alcohol consumption, since the model is only able to predict 53 individuals with truely high alcohol consumption, while the true value is 112.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

The plot visualizes the previous finding: The model is good at predicting individuals, who have low alcohol consumption, but it performs worse predicting high alcohol consumption.

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66492147 0.04188482 0.70680628
##    TRUE  0.15445026 0.13874346 0.29319372
##    Sum   0.81937173 0.18062827 1.00000000

In total, the model predicts 75 individuals wrong (roughly 20 % of the sample). The above table is consistent with the table shown before the plot, but now the same information is given in percentages. According to the prediction, 81,2% of individuals are not high users of alcohol and 18 % are high users. In the sample, 70% are not high users of alcohol, and 29 % are high users. There is a 10 percentage point difference in the number of high alcohol consumers in the sample and the prediction.

The model performs much better than guessing in the case of predicting low alcohol consumption. However, individuals with high alcohol consumption the model performs poorly, and it is only slightly better than quessing.

# the logistic regression model m and dataset alc with predictions are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.1963351
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability

The training error is 19,6 %, and this is implies that the model predicts roughly 20% of the true values wrong.

# the logistic regression model m and dataset alc (with predictions) are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.1963351
# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2198953

The average number of wrong predictions in the training data is 19,6 %. The average number of wrong predction in the 10-fold cross-validation is 0.209, and this is clearly lower than in the model introducted in data camp (with 0.26 error).

I create a model with all variables (excluding alcohol use and high use) on the dataset.

m <- glm(high_use ~  school +     sex +        age +        address +   famsize +    Pstatus  + Medu +       Fedu +       Mjob + Fjob +       reason +     nursery +    internet   + guardian + traveltime + studytime + failures + schoolsup + 
famsup +    paid +      activities + higher + romantic + famrel + freetime+ goout+ Dalc+ Walc +       health + absences + G1 + G2 +  G3  , data = alc, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   270    0
##    TRUE      0  112
summary(alc$high_use)
##    Mode   FALSE    TRUE 
## logical     270     112
# the logistic regression model m and dataset alc with predictions are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

With this model, there is no errors in predictions, since all data is fitted to the model. The training error is zero. Let’s put more variables the previous model with four variables, and add age, studytime, medu, fedu, mjob, fjob and G1, G2 and G3:

m <- glm(high_use ~   famrel +age + goout + absences + sex +G1 + G2 + G3 +Medu + studytime +      Fedu +       Mjob + Fjob      , data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   248   22
##    TRUE     57   55
summary(alc$high_use)
##    Mode   FALSE    TRUE 
## logical     270     112
# the logistic regression model m and dataset alc with predictions are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2068063
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

The training error is 0.2068063 and the ammount of wrong predictions clearly increases.

Remove G1, G2, G3:

m <- glm(high_use ~   famrel +age + goout + absences + sex +Medu + studytime +      Fedu +       Mjob + Fjob      , data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   246   24
##    TRUE     57   55
summary(alc$high_use)
##    Mode   FALSE    TRUE 
## logical     270     112
# the logistic regression model m and dataset alc with predictions are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2120419
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

Removing these variables increased the number of predictions of high use by 2 for people who actually are not high users. The model performs slightly worse than the model with 10 explanotary variables. The training error increases slightly.

Next, remove medu, fedu, mjob, fjob.

m <- glm(high_use ~   famrel +age + goout + absences + sex  + studytime     , data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252   18
##    TRUE     65   47
summary(alc$high_use)
##    Mode   FALSE    TRUE 
## logical     270     112
# the logistic regression model m and dataset alc with predictions are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2172775
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

Now the model is relatively worse on predicting low use, but it performs better on predicting high use. The training error increases with a very small change.

m <- glm(high_use ~   famrel + absences     , data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   259   11
##    TRUE    100   12
summary(alc$high_use)
##    Mode   FALSE    TRUE 
## logical     270     112
# the logistic regression model m and dataset alc with predictions are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2905759
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

The training error increases, and the model clearly yields worse predictions. It seems that removing a couple of variables has a very small change on how accurate the predictions are, unless the number of explanatory variables is very low.


Clustering and classification

library(MASS)
## Warning: package 'MASS' was built under R version 3.4.4
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggplot2)
library(GGally)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, plotly, stats
## lag():    dplyr, stats
## select(): dplyr, plotly, MASS
data('Boston')
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The dataset contains housing values in suburbs of Boston, and other variables about that describe the area i.e. crime rate, accessibilty to radial railways and air quality. The data contains 506 observations and 14 variables. All variables are numbers, and only two of them are integers.

The meta text can be found here.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Crime rate seems to variate a lot between different areas. Some areas do have a low proportion of non-retail business acres per town, so the town has separeted residental areas. Most of the housing is not located by the Charles River. Nitrogen oxides concentration doubles between the minmum and maximum, indicating that the air pollution is concentrated to some areas.

Segregation may be an issue here, since in some areas there are not many black people living in. However, the difference between first quater and third quater is low, so there are only very few areas where not many black people are living in.

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) 

# print the correlation matrix
cor_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos="b", tl.pos="d", t1.cex=0.6)
## Warning in text.default(pos.ylabel[, 1] + 0.5, pos.ylabel[, 2],
## newcolnames[1:min(n, : "t1.cex" is not a graphical parameter
## Warning in title(title, ...): "t1.cex" is not a graphical parameter

Weighted mean of distances to five Boston employment centres is highly negatively correlated with proportion of owner-occupied units built prior to 1940, nitrogen oxides concentration and proportion of non-retail business acres per town. Old nabourhoods are typically residental areas. The air contains more nitrogen oxides in areas with lots of traffic, like in the active employment centers. Areas with non-retail businesses probably have less opportunities for employment, and these areas are more likely to be residental areas.

Median value of owner-occupied homes has a high negative correlation with percentage of lower status people of the population.

Index of accessibility to radial highways is highly positively correlated with full-value property-tax rate, indicating that properties close to highways are more expensive (higher value of the property equals higher tax).

# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Now the mean of the variables is scaled to zero. Now it’s easier to see from i.e. variable crime that high crime rate is really consentrated to small area, since the minimum value and first quantiles is close to the mean, but the maximum value is very high. Segregation may be a problem in a couple of areas, since the minimum of variable black is quite far away from the mean, indicating that there are some residental areas with very few black people living in there. Some areas are quite far from the employment centres (dis max 3.9566).

# From matrix to data frame
boston_scaled <- as.data.frame(boston_scaled)

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Now the crime rate is split to four quantiles. I Split the data to training (80% of the data) and test sets (20 % of the data) and removed the original crime rate variable, as instructed in exercise 6.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2425743 0.2425743 0.2450495 0.2698020 
## 
## Group means:
##                   zn      indus        chas        nox           rm
## low       1.08083795 -0.8943943 -0.11163110 -0.8800110  0.470091268
## med_low  -0.09084544 -0.2974862  0.04906687 -0.5663270 -0.126221526
## med_high -0.39022525  0.1826612  0.16512651  0.3813835  0.008125402
## high     -0.48724019  1.0169738 -0.01948777  1.0407399 -0.388118352
##                 age        dis        rad        tax     ptratio
## low      -0.9338618  0.9077873 -0.6959263 -0.7592490 -0.50263928
## med_low  -0.3559213  0.3253729 -0.5588134 -0.5204595 -0.05440252
## med_high  0.4435240 -0.3462913 -0.4389595 -0.3198335 -0.19175058
## high      0.8049719 -0.8451343  1.6395837  1.5150965  0.78247128
##               black       lstat         medv
## low       0.3715361 -0.78036789  0.576251686
## med_low   0.3155938 -0.12938995  0.001647569
## med_high  0.1318285  0.01316435  0.096266769
## high     -0.8367362  0.83659614 -0.658639601
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.164389572  0.830481133 -0.92716146
## indus    0.016212860 -0.098096658  0.21570395
## chas    -0.020152805 -0.002890718  0.20181786
## nox      0.358022060 -0.624629773 -1.45044357
## rm      -0.001217507 -0.034059074 -0.09351584
## age      0.275071835 -0.427493886 -0.33109216
## dis     -0.118260639 -0.316701799  0.02399022
## rad      3.327591737  1.201305025  0.07126450
## tax      0.072289762 -0.401866808  0.47066079
## ptratio  0.146361446 -0.006009057 -0.28893602
## black   -0.106148344  0.009681031  0.09780773
## lstat    0.124165435 -0.177824080  0.44044775
## medv     0.052393518 -0.363980200 -0.28121416
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9525 0.0355 0.0120
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Two clusters, that are individually linear. However, together these datasets would make the linearity look quite different. High crime is consentrated to one cluster, and low rates to other cluster.

  1. I saved the crime categories from the test set and then removed the categorical crime variable from the test dataset earlier, in part 4.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13      15        1    0
##   med_low    3      14       11    0
##   med_high   0       9       16    2
##   high       0       0        0   18

The model seems to predict very well high crime rate, only predicting once high when the real value was not high. Similarly, with medium high crime rate, the model predicts 15 times medium high, 3 times medium low and once low. The model failed to predict four times medium high, but succeeded fifteen times.

With true value medium low, the model predicts less than 40 % correctly. With low crime rate, the model fails to predict correctly roughly 30 % of the true values. However, the failed predictions usually indicated low or medium low crime rate. Thus, the model seems to be good at predicting high crime rate and medium high crime rate, but it fails predicting two lowest quantiles of crime rates.

Open the data again and standardize it again

data('Boston')
# euclidean distance matrix

# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Calculate the distance between observations using Euclidean distance matrix and Manhattan distance matrix:

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = 'manhattan')

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Euclidian distance would be a good measure for distance, since it’s often used for scaled data.

boston_scaled <- as.data.frame(boston_scaled)

km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled[1:7], col= km$cluster)

pairs(boston_scaled[6:10], col = km$cluster)

pairs(boston_scaled[10:14], col = km$cluster)

I compared the plots with different numbers of clusters. It would seem reasonable to use two or three clusters, and I ended up using two since the clusters are this way very clear on the plots.

Crime rate is very often clustered to two clusters, when looking at the first plot i.e. proportion of residential land zoned for lots and proportion of non-retail business acres per town. Crimes are centered to areas with low proportion of residential land zoned for lots. Areas with high proportion of non-retail business acres per town have high variation in crime rates, but overall these areas have a higher crime rate.

Super bonus:

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)

Here the color is defined by the crime classes.

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= km$cluster[1:404])

The second plot doesn’t look as I expected it to. I thought there would be two clusters, as how one can see by eye. Now the clustering seems very random. From the first plot we can see, that high crime rates are clearly clustered away from others, indicating that areas with high crime rates have different charasteristics from other areas.


Dimensionality reduction techniques

library(ggplot2)
library(GGally)
library(dplyr)
library(corrplot)

Overview of the data

human <- read.table("human.txt", header = TRUE)
summary(human)
##     se.ratio        LFPR.ratio     Expected.Education Life.Expectancy
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40      Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25      1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50      Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18      Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20      3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20      Max.   :83.50  
##     GNI.cap            MMR              ABR               PR       
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ se.ratio          : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ LFPR.ratio        : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Expected.Education: num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Expectancy   : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI.cap           : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ MMR               : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ ABR               : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ PR                : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8

The data contains eight variables and 155 rows.

The dataset is a combined from a data of Human Development index and Gender Inequality index. These datasets are from United Nations Development Programme’s Human Development reports.

The variables correspond to:

  • se.ratio: ratio of females and males with at least secondary education
  • LFPR.ratio: ratio of females and males in the labor force
  • Expected.Education: expected years of education
  • Life.Expectancy: life expectancy at birth
  • GNI.cap: Gross national income per capita
  • MMR: maternal mortality ratio
  • ABR: adolescent birth ratio
  • PR: Percent Representation in Parliament

See the datapage for further information here.

Technical notes can be found here.

ggpairs(human)

cor(human) %>% corrplot

Maternal mortality ratio is positively correlated with adolescent birth ratio, indicating that young mothers face risks when giving birth. Expected education and maternal mortality rate have a high negative correlation. Expected education and life expancy are also highly positively correlated. Expected education and adolescent birth ratio are highly negatively correlated. Life expancy is negatively correlated with maternal mortality ratio and adolescent birth ratio. It would be important to study, whether education has a negative causal effect on adolescent birth ratio.

  • se.ratio: ratio of females and males with at least secondary education
  • LFPR.ratio: ratio of females and males in the labor force
  • Expected.Education: expected years of education
  • Life.Expectancy: life expectancy at birth
  • GNI.cap: Gross national income per capita
  • MMR: maternal mortality ratio
  • ABR: adolescent birth ratio
  • PR: Percent Representation in Parliament

PCA on non-standardized data

# Mean again zero. Check how affects the variance
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

If the data is not standarized, one principal component would explain alone the variotion on the data. However, this graph remains uninformative and not useful for interpretation.

Standardize variables

human_std <- scale(human)

pca_human <- prcomp(human_std)

s <- summary(pca_human)

# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1) 

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], sub = "The dimensionality of human development reduced to two principal components (PC). ")

First PC captures more than 50% of the variance of the original eight variables.

Interpertation

Ratio of females and males in the labor force is almost orthogonal to all variables excluding percent repsentation in parliament, indicating that ratio of females and males in the labor force has very low correlation with these variables. Same applies to percetage of representation in parliament.

Maternal mortality ratio is positively correlated with adolesecent birt ratio, since arrows in the plot point at same direction. These variables are highly negatively correlated to expected years of education, gross national income per capita, ratio of females and males with at least secondary education and life expectancy at birth. These last four variables are positively correlated between each other.

All arrows in the plot are quite long, indicating that each variable has non-zero standard deviations. Percent representation in parliament and ratio of females and males in the labor force contributes to the second principal axis (PC2), since these arrows do point closely to same direction as the PC2-axis. Thus, rest of the variables contribute to the first principal axis (PC1). Therefore, the PC2 axis can be interpreted as how much females participate in parliment representation and to labor force. PC1 axis is trickier to interpret, since more variables are related to it. It describes the socio-economical conditions of women in a country. Low values in PC1 axis indicates good conditions for women and high values the opposite. High values in PC2 implies that females a large portion of females participate to labor force and are represented in parliament.

For example, Nordic countries in top-right corner of the plot have low values in PC1-axis and high values PC2-axis, indicating good socio-economical conditions and high participation rate of females to labour force and parliament. I.e. Sudan, Mauritania and Afganistan are the opposite: these countires have high values for PC1 axis and low values for PC2 axis, indicating worse conditions for females, if compared to Nordic countries.

Tea drinkers

300 consumers were interviewed for the data. The data contains 300 observations and 36 variables. Consumers answered questions related to their tea consumption preferences, where they buy it and when they consume it. They answered questions related to their background i.e. sex, age and questions related to socio-economic background. The data is downloaded from the FactoMineR package for R. More information can be found here.

I selected columns “Tea”, “How”, “how”, “sugar”, “where” and “lunch” for the analysis.

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.4.4
library(ggplot2)
library(tidyr)
library(stringr)
## Warning: package 'stringr' was built under R version 3.4.4
data(tea)

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 

Most of the tea is consumed using a tea bag. Usually no milk or lemon is added to the tea. Most popular tea type is Earl Grey, and secondary choice is black tea. Only few people consume green tea. Half add sugar to tea. Usually, the tea is bought from chain store.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

For dimension 1, the vtest value is lower than 1.96 for categories alone and other, indicating that these categories are not significantly different from zero. For dimension 2, black tea test value is lower than 1.96 in absolute value. Thus, for dimension 2, black tea is not significantly different from zero.

# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

The distance between variable categories gives a measure of their similarity. Dim 1 explains 15,24 % of variation of the variables, and Dim 2 explain 14.23 % of the variation. These dimensions do not explain much of the variation, and the dimensions explain almost equally as much variation as the other dimension on the plot.

Tea shops and upackaged tea are closely related. Tea bags are closely related to chain stores. Tea bags + unpackeged tea together are related to chain stores. Green tea is different from the other categories. Milk and sugar seem to be added often to Earl Grey, and black tea is often consumed without sugar.

plotellipses(mca, keepvar=(1:5))

With the confidence ellipses, we can see that the confidence ellipses are not large. The confidence ellipses are separated from each other, so the different populations on the plot are distinct.